Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: CLI command to autogenerate JSON Schema for PL, RQ and lineage #4698

Merged
merged 5 commits into from
Jul 5, 2024

Conversation

kgutwin
Copy link
Collaborator

@kgutwin kgutwin commented Jul 4, 2024

This PR adds the command prqlc debug json-schema --schema-type TYPE. When run, this dumps a JSON Schema document for the provided type (currently pl, rq, and lineage).

Example:

github/prql % target/debug/prqlc debug json-schema --schema-type pl | head -30
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "ModuleDef",
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "stmts": {
      "type": "array",
      "items": {
        "$ref": "#/$defs/Stmt"
      }
    }
  },
  "required": [
    "name",
    "stmts"
  ],
  "$defs": {
    "Annotation": {
      "type": "object",
      "properties": {
        "expr": {
          "$ref": "#/$defs/Expr"
        }
      },
      "required": [
        "expr"
      ]

The longer-term goal behind adding this as a feature is to find a way to auto-generate type hints for library integrations (Python, TypeScript, etc.) However, this may also be useful as a debugging or documentation tool.

@kgutwin kgutwin requested a review from max-sixty July 4, 2024 15:10
Copy link
Member

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

Could we add a small test? A bit like the one in the PR description. There are examples of CLI insta tests by searching for prqlc_command().

We could also add a changelog entry as it's user-facing

Thanks @kgutwin!

@@ -266,6 +274,13 @@ enum Format {
Yaml,
}

#[derive(clap::ValueEnum, Clone, Debug)]
enum SchemaType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super important but I wonder if there's a better name for this — it's essentially each intermediate representation — maybe --ir for intermediate representation, or --repr? Or fine to leave and we can change in the future

@kgutwin
Copy link
Collaborator Author

kgutwin commented Jul 5, 2024

Great feedback! I put together a small test and renamed the flag to --ir-type.

As a related question (towards the underlying purpose of this PR) I am wondering what you think about adding automated code generation to this repo, for the purposes of keeping type hints for the bindings in sync with the gradual evolution of PRQL's IRs. In my main project now, we use Pre-commit hooks to trigger a re-run of client code generation whenever our backend's OpenAPI spec is updated. This works really well, as the hooks ensure that a commit with changes to the backend will also include the corresponding changes to the frontend and other client libraries.

If PRQL were to adopt the same general practice, it would involve adding one or two pre-commit hooks:

  1. Optionally, pre-commit could run prqlc debug json-schema any time there's a change to the IR definition, and the output could be stored in the repo. This would allow casual browsers of the repository to read the JSON Schema without needing to run the tool, and/or the schema could be included in the documentation/web site. But if that doesn't sound useful, this step can be bundled together with the next one.
  2. Pre-commit would also then run any appropriate code generation tools (I'm currently looking at datamodel-code-generator for Python, for example) whenever the schema changes. Code generation from schema is valuable because it imposes no extra runtime overhead for bindings, and it stays in lockstep with versioning, which is important for an actively evolving project like this one.

If you think this approach would work for PRQL, I can put it together as a PR for review. If you have other opinions or thoughts, I'm happy to accommodate. Thanks!

@max-sixty max-sixty merged commit 6545978 into PRQL:main Jul 5, 2024
81 of 83 checks passed
@max-sixty
Copy link
Member

(forgive the delay; just got back from vacation)

Optionally, pre-commit could run prqlc debug json-schema any time there's a change to the IR definition, and the output could be stored in the repo. This would allow casual browsers of the repository to read the JSON Schema without needing to run the tool, and/or the schema could be included in the documentation/web site. But if that doesn't sound useful, this step can be bundled together with the next one.

Storing them in the repo is a great idea (one of the reasons we like snapshots a lot!)

pre-commit is a clever way of doing that. One constraint is that cargo run -p prqlc -- debug json-schema won't run in pre-commit-ci, because that doesn't have internet access, which is required for building the crates. So we could instead:

Pre-commit would also then run any appropriate code generation tools (I'm currently looking at datamodel-code-generator for Python, for example) whenever the schema changes. Code generation from schema is valuable because it imposes no extra runtime overhead for bindings, and it stays in lockstep with versioning, which is important for an actively evolving project like this one.

This sounds interesting. I don't have much experience with these so don't know how well they work. IIUC most folks using bindings aren't going deep in the AST (vs. just compiling to SQL), but if this would be helpful + the tools are good — in particular if they don't impose a cost on those just compiling to SQL, would be very open to it.

Thanks @kgutwin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants